While my previous posts outlined methods for conducting EDA for numeric data as well as categorical data, this post focuses on EDA for images.

What is Exploratory Data Analysis (EDA)?

Again, since all learning is repetition, EDA is a process by which we 'get to know' our data by conducting basic descriptive statistics and visualizations.

Why is it done for images?

We need to know:

  • how many images we have
  • if we're doing supervised learning, if they labeled appropriately
  • their format (i.e. size and color)

How do we do it in Python?

Step 1: Frame the Problem

"Is it possible to determine the minimum age a reader should be for a given book based solely on the cover?"

Step 2: Get the Data

As mentioned in my previous posts, I sourced labeled training data from Common Sense Media's Book Reviews by scraping and saving the target pages using BeautifulSoup

and then extracted and saved the book covers into a separate folder.

In the end, I was able to use over 5000 covers for training and testing purposes, but today we'll only work with a sample of the covers which can be downloaded from here.

Step 3: Explore the Data to Gain Insights (i.e. EDA)

As always, import the essential libraries, then load the data.

import pandas as pd
import numpy as np
import os
import cv2
from PIL import Image
import ipyplot

How large is our sample?

IMAGES_PATH = "data/covers/"
image_files = list(os.listdir(IMAGES_PATH))
full_file_paths = [IMAGES_PATH+image for image in image_files] 
print("Number of image files: {}".format(len(image_files)))
Number of image files: 561

What does our target look like?

To answer that question, we can create a data frame of the book titles and the target ages in our sample, and then plot the target.

Since I scraped the data, I know the beginning of the file name is the target age, (e.g., 13 is the minimum age for the file '13_dance-of-thieves-book-1.jpg') so we can create a data frame of the:

  • file names
  • full paths
  • and a target column called age by splitting the file name on the underscore and extracting the first element
data = {'files':image_files, 'full_path':full_file_paths}
df = pd.DataFrame(data=data)
df['age'] = df['files'].str.split("_").str[0].astype('int')
df.head()
files full_path age
0 13_dance-of-thieves-book-1.jpg data/covers/13_dance-of-thieves-book-1.jpg 13
1 11_ways-to-live-forever.jpg data/covers/11_ways-to-live-forever.jpg 11
2 13_this-time-will-be-different.jpg data/covers/13_this-time-will-be-different.jpg 13
3 10_the-care-and-keeping-of-you-2-the-body-book... data/covers/10_the-care-and-keeping-of-you-2-t... 10
4 8_moonpenny-island.jpg data/covers/8_moonpenny-island.jpg 8

Now we can plot the age feature.

df['age'].plot(kind= "hist", 
               bins=range(2,18),
               figsize=(24,10),
               xticks=range(2,18),
               fontsize=16);

Thankfully, the plot above has a nearly identical distribution to the entire sample (see this post) so all is good and we can continue.

What do our covers look like?

We know the general shape of our target, but let's get a feel for what the targets (i.e. the book covers) look like by using the IPyPlot package.

To do so, we convert the path to the images and the target numpy arrays:

images = df['full_path'].to_numpy()
labels_int = df['age'].to_numpy()

and then pass them as arguments to the plot_class_representations function which will return the first instance of each of our targets.

In other words, the function will print the first book which rated for 2 year olds, 3 year olds, 4 year olds, (etcetera) until all levels of the target are represented.

ipyplot.plot_class_representations(images=images, labels=labels_int, force_b64=True)

2

data/covers/2_ten-little-caterpillars.jpg

3

data/covers/3_bully.jpg

4

data/covers/4_thomas-big-storybook.jpg

5

data/covers/5_yukon-sled-dog.png

6

data/covers/6_all-in-a-day-0.jpg

7

data/covers/7_enigma-a-magical-mystery.jpg

8

data/covers/8_moonpenny-island.jpg

9

data/covers/9_the-wide-window-a-series-of-unfortunate-events-book-3.jpg

10

data/covers/10_the-care-and-keeping-of-you-2-the-body-book-for-older-girls.jpg

11

data/covers/11_ways-to-live-forever.jpg

12

data/covers/12_gilded.jpg

13

data/covers/13_dance-of-thieves-book-1.jpg

14

data/covers/14_the-madmans-daughter.jpg

15

data/covers/15_a-sense-of-the-infinite.jpg

16

data/covers/16_the-round-house.jpg

17

data/covers/17_pretty-dead.jpg

:thinking: There seems to be a correlation between the dimensions of the books and the target age; books for younger readers are more square whereas books for older readers are more rectangular.

Let's investigate that further by plotting multiple covers per age which we can do by using the plot_class_tabs function.

ipyplot.plot_class_tabs(images=images, labels=labels_int, max_imgs_per_tab=8, force_b64=True)

0

data/covers/2_ten-little-caterpillars.jpg

1

data/covers/2_when-mama-comes-home-tonight.jpg

2

data/covers/2_your-babys-first-word-will-be-dada.jpg

3

data/covers/2_te-amo-sol-te-amo-luna-i-love-you-sun-i-love-you-moon.jpeg

4

data/covers/2_cat-the-cat-who-is-that.jpg

5

data/covers/2_goodnight-moon.jpg

6

data/covers/2_corduroy.jpg

7

data/covers/2_10-little-monsters-a-counting-book.jpg

0

data/covers/3_bully.jpg

1

data/covers/3_the-donut-chef.jpg

2

data/covers/3_big-bear-little-chair.jpg

3

data/covers/3_fetch.jpeg

4

data/covers/3_a-funny-little-bird.jpg

5

data/covers/3_the-biggest-smallest-christmas-present.jpg

6

data/covers/3_penguin-problems.jpg

7

data/covers/3_camp-rex.jpg

0

data/covers/4_thomas-big-storybook.jpg

1

data/covers/4_goodnight-good-dog.jpg

2

data/covers/4_baby-monkey-private-eye.jpg

3

data/covers/4_lion-lessons.jpg

4

data/covers/4_chicken-cheeks.jpg

5

data/covers/4_not-afraid-of-dogs.jpg

6

data/covers/4_scrambled-eggs-super.jpg

7

data/covers/4_ten-creepy-monsters.jpg

0

data/covers/5_yukon-sled-dog.png

1

data/covers/5_the-princess-in-black-takes-a-vacation.jpg

2

data/covers/5_aliens-are-coming.jpg

3

data/covers/5_the-last-day-of-kindergarten.jpg

4

data/covers/5_where-are-you-from.jpg

5

data/covers/5_today-i-will-fly-an-elephant-piggie-book.jpg

6

data/covers/5_i-am-love.jpg

7

data/covers/5_zelda-and-ivy-series.jpg

0

data/covers/6_all-in-a-day-0.jpg

1

data/covers/6_almost-to-freedom.jpg

2

data/covers/6_once-upon-a-twice.jpg

3

data/covers/6_come-back-amelia-bedelia.jpg

4

data/covers/6_the-secrets-of-animal-flight.jpg

5

data/covers/6_in-daddys-arms-i-am-tall-african-americans-celebrating-fathers.jpg

6

data/covers/6_the-sun.jpg

7

data/covers/6_capital-mysteries-series.jpg

0

data/covers/7_enigma-a-magical-mystery.jpg

1

data/covers/7_the-legendary-miss-lena-horne.jpg

2

data/covers/7_oggie-cooder-1.jpg

3

data/covers/7_the-return-of-zita-the-spacegirl.jpg

4

data/covers/7_lulu-series.jpg

5

data/covers/7_the-boy-who-touched-the-stars-el-nino-que-alcanzo-las-estrellas.jpg

6

data/covers/7_how-many.jpg

7

data/covers/7_a-christmas-memory.jpg

0

data/covers/8_moonpenny-island.jpg

1

data/covers/8_big-game-funjungle-book-3.jpg

2

data/covers/8_fortunately-the-milk.jpg

3

data/covers/8_spirit-week-showdown-magnificent-mya-tibbs-book-1.jpg

4

data/covers/8_the-91-story-treehouse-the-treehouse-books-book-7.jpg

5

data/covers/8_best-friends.jpg

6

data/covers/8_sassy-series.jpg

7

data/covers/8_story-thieves-book-1.jpeg

0

data/covers/9_the-wide-window-a-series-of-unfortunate-events-book-3.jpg

1

data/covers/9_star-wars-the-return-of-the-jedi-beware-the-power-of-the-dark-side.jpg

2

data/covers/9_red-rackhams-treasure-the-adventures-of-tintin.jpg

3

data/covers/9_the-golden-dream-of-carlo-chuchio.jpg

4

data/covers/9_fall-of-heroes-the-cloak-society-book-3.jpg

5

data/covers/9_the-shakespeare-stealer.jpg

6

data/covers/9_wanderville.jpg

7

data/covers/9_the-mighty-miss-malone.jpg

0

data/covers/10_the-care-and-keeping-of-you-2-the-body-book-for-older-girls.jpg

1

data/covers/10_turning-15-on-the-road-to-freedom-my-story-of-the-selma-voting-rights-march.jpg

2

data/covers/10_a-month-of-sundays.jpg

3

data/covers/10_the-wednesday-wars.jpg

4

data/covers/10_the-fowl-twins.jpg

5

data/covers/10_the-supernaturalist.jpg

6

data/covers/10_a-wizard-of-earthsea-the-earthsea-cycle-book-1.jpg

7

data/covers/10_a-plague-of-bogles.jpg

0

data/covers/11_ways-to-live-forever.jpg

1

data/covers/11_forge-the-seeds-of-america-trilogy-book-2.jpg

2

data/covers/11_the-lord-of-opium.jpg

3

data/covers/11_foiled.jpg

4

data/covers/11_the-red-pencil.jpg

5

data/covers/11_bluecrowne-a-greenglass-house-story.jpg

6

data/covers/11_high-wizardry-young-wizards-series-book-3.jpg

7

data/covers/11_step-by-wicked-step-a-novel.jpg

0

data/covers/12_gilded.jpg

1

data/covers/12_scars-like-wings.jpg

2

data/covers/12_if-i-ever-get-out-of-here.jpg

3

data/covers/12_boots-on-the-ground-americas-war-in-vietnam.jpg

4

data/covers/12_the-girl-of-fire-and-thorns-book-1.png

5

data/covers/12_abarat-days-of-magic-nights-of-war-the-abarat-trilogy-book-2.jpg

6

data/covers/12_the-tragedy-paper.jpg

7

data/covers/12_american-ace.jpg

0

data/covers/13_dance-of-thieves-book-1.jpg

1

data/covers/13_this-time-will-be-different.jpg

2

data/covers/13_gem-dixie.png

3

data/covers/13_dear-bully-70-authors-tell-their-stories.jpg

4

data/covers/13_on-the-fence.jpg

5

data/covers/13_screen-queens.jpg

6

data/covers/13_before-we-were-free.jpg

7

data/covers/13_of-mice-and-men.jpg

0

data/covers/14_the-madmans-daughter.jpg

1

data/covers/14_neighborhood-girls.jpg

2

data/covers/14_another-day.jpg

3

data/covers/14_slow-burn-the-anchor-and-sophia-book-2.jpg

4

data/covers/14_repossessed.jpg

5

data/covers/14_blood-red-road-dust-lands-book-1.jpg

6

data/covers/14_a-heart-so-fierce-and-broken-cursebreaker-book-2.jpg

7

data/covers/14_confessions-of-a-murder-suspect.jpg

0

data/covers/15_a-sense-of-the-infinite.jpg

1

data/covers/15_aspen.jpg

2

data/covers/15_perfect-0.jpg

3

data/covers/15_the-diviners-book-1.jpg

4

data/covers/15_breathe-my-name.jpg

5

data/covers/15_dear-evan-hansen-the-novel.jpg

6

data/covers/15_the-program-book-1.jpeg

7

data/covers/15_one-day.jpg

0

data/covers/16_the-round-house.jpg

1

data/covers/16_1984.jpg

2

data/covers/16_the-handmaids-tale.jpg

3

data/covers/16_testimony-from-your-perfect-girl.jpg

4

data/covers/16_home-after-dark.jpeg

5

data/covers/16_dirty-wings.jpg

6

data/covers/16_impulse.jpg

7

data/covers/16_exile-from-eden.jpg

0

data/covers/17_pretty-dead.jpg

:thinking: Hmmmmm. Could be true but we'll need more evidence to be certain.

To investigate this hypothesis, we can investigate a possible correlation between the size of the cover and the target age.

What are the sizes and channels of our covers?

The size of our covers will be the height and width of our images and, importantly, number of channels is whether the cover is in color (three channels) or graysacle (two).

We can compute the dimensions of the books by extracting the height and width from shape of the images.

First create a list of arrays for the covers:

covers = [cv2.imread(IMAGES_PATH+image) for image in image_files]

Congratulations! All of our covers are now stored as a list of arrays of pixel data so we can use shape to inspect the dimensions of our covers.
For example:

covers[0].shape
(255, 170, 3)
import matplotlib.pyplot as plt

sample = df.iloc[0,1]
sample_img = cv2.imread(sample, 1)

plt.imshow(sample_img, interpolation = 'bicubic')
plt.xticks([]), plt.yticks([])  # to hide tick values on X and Y axis
plt.show()
width = [cover.shape[1] for cover in covers]
height = [cover.shape[0] for cover in covers]
chan
width = []
height = []
channels = []
for image in image_files: 
    img = cv2.imread(IMAGES_PATH+image)
    img = img.shape
    height.append(img[0])
    width.append(img[1])
    channels.append(img[2])
df['width'] = width
df['height'] = height
df['channels'] = channels
df.head()
files full_path age width height channels
0 13_dance-of-thieves-book-1.jpg data/covers/13_dance-of-thieves-book-1.jpg 13 170 255 3
1 11_ways-to-live-forever.jpg data/covers/11_ways-to-live-forever.jpg 11 170 255 3
2 13_this-time-will-be-different.jpg data/covers/13_this-time-will-be-different.jpg 13 170 255 3
3 10_the-care-and-keeping-of-you-2-the-body-book... data/covers/10_the-care-and-keeping-of-you-2-t... 10 170 255 3
4 8_moonpenny-island.jpg data/covers/8_moonpenny-island.jpg 8 170 255 3

Summary

  • :ballot_box_with_check: numeric data
  • :ballot_box_with_check: categorical data
  • :black_square_button: images (book covers)

Two down; one to go!

Going forward, my key points to remember are:

What type of categorical data do I have?

There is a huge difference between ordered (i.e. "bad", "good", "great") and truly nominal data that has no order/ranking like different genres; just because I prefer science fiction to fantasy, it doesn't mean it actually is superior.

Are missing values really missing?

Several of the features had missing values which were, in fact, not truly missing; for example, the award and awards features were mostly blank for a very good reason: the book didn't win one of the four awards recognized by Common Sense Media.

In conclusion, both of the points above can be summarized simply by as "be sure to get to know your data."

Happy coding!

Footnotes


2. Be sure to check out this excellent post by Jeff Hale for more examples on how to use this package



4. Big Thank You to Chaim Gluck for providing this tip